Trying out community detection. At this point can only do this on the STRING based weighted edges generated in this notebook.


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
# Some nice default configuration for plots
plt.rcParams['figure.figsize'] = 10, 7.5
plt.rcParams['axes.grid'] = True
plt.gray()


<matplotlib.figure.Figure at 0x30f9ad0>

In [2]:
import pandas as pd

In [3]:
cd ../../forGAVIN/mergecode/OUT/


/data/opencast/MRes/forGAVIN/mergecode/OUT

In [4]:
!git annex unlock ../../../HBP/testdata/edgelist_update_weighted.txt


unlock ../../../HBP/testdata/edgelist_update_weighted.txt (copying...) ok

In [5]:
cp edgelist_update_weighted.txt ../../../HBP/testdata/

In [6]:
cd ../../../HBP/


/data/opencast/MRes/HBP

Community detection code from Colin written in C++:


In [292]:
!make


Building with option-useomp
g++     -DUSEOMP -O3 -fopenmp -fomit-frame-pointer -funroll-loops -fforce-addr -fexpensive-optimizations -Wno-deprecated -g3 -o run communityDetection.C 

This is how to run it:


In [7]:
!./run


./run requires 2 arguments: 
-file         : the network file to run 
-quite        : Turn on/off comments to command line
              : 0 = switch on comments to command line
              : 1 = switch off comments to command line
-algorithm    : the type of algorithm to run
              : 1 = Geodesic edge Betweenness
              : 2 = Random edge Betweenness
              : 3 = Spectral Betweenness
-weighted     : specify if network file is *weighted or not: 
              : 0  = Using a non-weighted network file 
              : 1  = Using a     weighted network file 
-enrichment   : run the enrichment on the clustered network
              : 0 = run enrichment after clustering - Default
              : 1 = run enrichment without clustering
              : 2 = do not run enrichment
-----
*             : The structure of the network file is:
-weighted 1   : A (interactor) 	 B (interactor) 	 W (weight) 
-weighted 0   : A (interactor) 	 B (interactor) 
              : Where A and B are integers (Entrez IDs) and W is a double.
EXAMPLE 1
-----
./run -file testData/edgelist.txt -weighted 0 -algorithm 1
-----
EXAMPLE 2
-----
./run -file testData/karate.txt -weighted 1 -algorithm 1
-----

Output is written to the OUT file, so we have to unlock them first.


In [7]:
!git annex unlock OUT/*


unlock OUT/communities_cytoscape.txt (copying...) ok
unlock OUT/community-metric_nmi.txt (copying...) ok
unlock OUT/communityout.txt (copying...) ok
unlock OUT/consensusout.txt (copying...) ok
unlock OUT/enrichment_SCH_enriched.csv (copying...) ok
unlock OUT/enrichment_disease.csv (copying...) ok
unlock OUT/network.png (copying...) ok
unlock OUT/overlap_dis_SCHenriched.csv (copying...) ok
unlock OUT/p_significance_tests_SCH_enriched.csv (copying...) ok
unlock OUT/p_significance_tests_disease.csv (copying...) ok
unlock OUT/p_significance_tests_summary_SCH_enriched.csv (copying...) ok
unlock OUT/p_significance_tests_summary_disease.csv (copying...) ok
unlock OUT/p_value_network_SCH_enriched.csv (copying...) ok
unlock OUT/p_value_network_disease.csv (copying...) ok
unlock OUT/p_values_SCH_enriched.csv (copying...) ok
unlock OUT/p_values_disease.csv (copying...) ok
unlock OUT/permute_p_values_SCH_enriched.csv (copying...) ok
unlock OUT/permute_p_values_disease.csv (copying...) ok
unlock OUT/removededges.txt (copying...) ok

Unweighted case

Running the code on the unweighted case to begin with:


In [8]:
!./run -file testdata/edgelist_update_weighted.txt -weighted 0 -algorithm 3 -quite 1


> ---- START ---- 
> Using Spectral Betweenness.
> Using a non-weighted network 
> Scanning network_file: testdata/edgelist_update_weighted.txt
> Header: None 
> Running Spectral Betweenness.
--- DONE ---
> cputime: 275.53 seconds 
> Network (nodes): 1572 (edges): 9373
> ---- END ---- 

This file describes the communities produced:


In [9]:
!head -n 10 OUT/communityout.txt


communityout
Max Q: 0.36175
cputime: 275.53 seconds 
Network (nodes): 1572 (edges): 9373
we 0 dist 0 community 1 Degree 1 visited 0 key 1315 (55268)
community: 1 size: 1
we 0 dist 0 community 2 Degree 4 visited 0 key 3 (18)
we 0 dist 0 community 2 Degree 5 visited 0 key 50 (372)
we 0 dist 0 community 2 Degree 5 visited 0 key 62 (481)
we 0 dist 0 community 2 Degree 2 visited 0 key 63 (482)

Weighted case

This is the case using the weights generated using the various data sources available:


In [10]:
!git annex unlock unweightedOUT/*


unlock unweightedOUT/communities_cytoscape.txt (copying...) ok
unlock unweightedOUT/community-metric_nmi.txt (copying...) ok
unlock unweightedOUT/communityout.txt (copying...) ok
unlock unweightedOUT/consensusout.txt (copying...) ok
unlock unweightedOUT/enrichment_SCH_enriched.csv (copying...) ok
unlock unweightedOUT/enrichment_disease.csv (copying...) ok
unlock unweightedOUT/network.png (copying...) ok
unlock unweightedOUT/overlap_dis_SCHenriched.csv (copying...) ok
unlock unweightedOUT/p_significance_tests_SCH_enriched.csv (copying...) ok
unlock unweightedOUT/p_significance_tests_disease.csv (copying...) ok
unlock unweightedOUT/p_significance_tests_summary_SCH_enriched.csv (copying...) ok
unlock unweightedOUT/p_significance_tests_summary_disease.csv (copying...) ok
unlock unweightedOUT/p_value_network_SCH_enriched.csv (copying...) ok
unlock unweightedOUT/p_value_network_disease.csv (copying...) ok
unlock unweightedOUT/p_values_SCH_enriched.csv (copying...) ok
unlock unweightedOUT/p_values_disease.csv (copying...) ok
unlock unweightedOUT/permute_p_values_SCH_enriched.csv (copying...) ok
unlock unweightedOUT/permute_p_values_disease.csv (copying...) ok
unlock unweightedOUT/removededges.txt (copying...) ok

In [11]:
cp OUT/* unweightedOUT/

In [12]:
!./run -file testdata/edgelist_update_weighted.txt -weighted 1 -algorithm 3 -quite 1


> ---- START ---- 
> Using Spectral Betweenness.
> Using a weighted network 
> Scanning network_file: testdata/edgelist_update_weighted.txt
> Header: None 
> Running Spectral Betweenness.
--- DONE ---
> cputime: 279.53 seconds 
> Network (nodes): 1572 (edges): 9373
> ---- END ---- 

Normalised Mutual Information test

We can provide a measure of the similarity of the communities detected using a Normalised Mutual Information test. As wikipedia puts it:

mutual information measures the information that X and Y share: it measures how much knowing one of these variables reduces uncertainty about the other.

Scikit-learn provides this test, all that's required is to load the data into Python.


In [13]:
import csv

In [14]:
def parsecommunityout(fname):
    f = open(fname)
    c = csv.reader(f,delimiter=" ")
    communitydict = {}
    for l in c:
        if l[0] == "we":
            communitydict[l[-1].strip("()")] = int(l[5])
    sortedids = sorted(communitydict.keys(),key=int)
    communities = map(communitydict.__getitem__,sortedids)
    return sortedids,communities

In [15]:
wids,wcom = parsecommunityout("OUT/communityout.txt")

In [16]:
uwids,uwcom = parsecommunityout("unweightedOUT/communityout.txt")

In [17]:
import sklearn.metrics

In [18]:
sklearn.metrics.normalized_mutual_info_score(wcom,uwcom)


Out[18]:
0.60407676187875536

But, this result might not be correct, as the class labels sorting the nodes into communities are arbitrary.

Again, the below code is not mine:


In [19]:
from __future__ import division
import collections, re, math, sys, csv

In [20]:
#--- helper function
def convertStr(s):
    ret = int(s)
    return ret

In [21]:
rows, new, dic = [], [], []
consensus = "OUT/communities_cytoscape.txt"
community = "unweightedOUT/communities_cytoscape.txt"

#--- readin first community file
with open(consensus, 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter='=')
    headerline = reader.next();
    for row in reader:
        new.append(row)

#--- readin second community file
with open(community, 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter='=')
    headerline = reader.next();
    for row in reader:
        dic.append(row)

In [22]:
!git annex unlock OUT/community-metric_nmi.txt

In [23]:
for i in dic:
    for j in new:
        if i[0] == j[0]:
            i.append(j[1])
dic.sort(key=lambda x: int(x[2]))

#--- count of nodes in each community in reality and in algorithm
realc, algc = [], []
for i in dic:
    realc.append(convertStr(i[1]))
    algc.append(convertStr(i[2]))
comcount = dict([(i,realc.count(i)) for i in set(realc)])
pcount = dict([(i,algc.count(i)) for i in set(algc)])

#--- holds the algorithm community label with the corresponding community labels
#    provided e.g., {1:{2:2,3:3}}
algcount, tempd = {}, {}
templist = []
comtrack = 0
for key, val in pcount.iteritems():
    templist = realc[comtrack:comtrack+val]
    comtrack += val
    for i in set(templist):
        tempd[i] = templist.count(i)
    algcount[key] = tempd
    tempd = {}

#--- finding the node with the maximum label in that community
#    key = int, value = dictionary
label = {}
for key, value in algcount.iteritems():
    label[key] = max(value, key = value.get)
for i in dic:
    for j in label:
        if (int)(i[2]) == j:
            i.append(label[j])

#--- METRICS    
#--- Calculate the NMI
nt, np, ntp = {}, {}, {}
NMI = 0
nodes = len(dic)
for i in range(1, max(comcount)+1):
    for j in range(1, max(pcount)+1):
        nt[i] = 0
        np[j] = 0
        ntp[(str(i)+' '+str(j))] = 0
nt = comcount
np = pcount
for i in dic:
    ntp[(str(i[1]).strip()+' '+str(i[2]).strip())] += 1
num, den = 0, 0
for i, t in nt.iteritems():
    for j, p in np.iteritems():
        temp, temp2 = 0, 0
        if (ntp[(str(i).strip()+' '+str(j).strip())] > 0) and (t > 0) and (p > 0):
            temp = ((ntp[(str(i).strip()+' '+str(j).strip())]*nodes)/(t*p))
            temp2 = ntp[(str(i).strip()+' '+str(j).strip())]* math.log(temp)
        num += temp2
d1, d2 = 0, 0
for t in nt.values():
        d1 += t * math.log(t/nodes)
for p in np.values():
        d2 += p * math.log(p/nodes)
den = d1 + d2
NMI = -(2*num)/den


#--- print results to screen
print 'NMI = %s' %(NMI)
f = open ( 'OUT/community-metric_nmi.txt', 'w' )
f.writelines("node\treal\talg\tnewlabel\n")
f.writelines("%s\t%s\t%s\t%s\n" % (i[0],i[1],i[2],i[3]) for i in dic)
f.writelines('NMI = %s' %(NMI))
f.close()


NMI = 0.604022794092

So the result is almost exactly the same, but the above code appears to take into account the problem with arbitrary community labels. Good to see both methods getting similar results, in any case.

Disease Enrichment

Having a look at the disease enrichment now:


In [24]:
dew =  pd.read_csv("OUT/permute_p_values_disease.csv", delimiter="\t")
deuw = pd.read_csv("unweightedOUT/permute_p_values_disease.csv", delimiter="\t")

Check papers to read about the hypergeometric disease enrichment test:


In [25]:
dew.head()


Out[25]:
Community Size disease_of_metabolism p-value {p-value} Sig. Level p_lower (%) p_upper (%) disease_by_infectious_agent p-value.1 ... Sig. Level.135 p_lower (%).135 p_upper (%).135 muscular_dystrophy p-value.136 {p-value}.136 Sig. Level.136 p_lower (%).136 p_upper (%).136 Unnamed: 824
0 1 50 NaN 0.125016 0.541943 0.010459 12.5 93.0 NaN 0.837946 ... 0.015763 45.7 89.0 NaN 0.598702 0.659976 0.015500 59.9 76.4 NaN
1 2 48 NaN 0.761838 0.562907 0.013470 74.6 39.4 NaN 0.989202 ... 0.000000 100.0 55.6 NaN 0.051262 0.671740 0.006974 5.2 99.6 NaN
2 3 26 NaN 0.574356 0.576412 0.015636 58.3 64.3 NaN 0.406454 ... 0.014087 28.9 95.9 NaN 1.000000 0.731936 0.000000 100.0 61.4 NaN
3 4 48 NaN 0.877369 0.564046 0.010373 88.7 23.8 NaN 0.033306 ... 0.015723 43.4 88.9 NaN 0.009142 0.668188 0.003010 1.1 99.7 NaN
4 5 26 NaN 0.574356 0.610789 0.015636 52.1 70.0 NaN 0.812738 ... 0.000000 100.0 73.9 NaN 0.375665 0.743471 0.015315 37.1 91.9 NaN

5 rows × 825 columns


In [26]:
deuw.head()


Out[26]:
Community Size disease_of_metabolism p-value {p-value} Sig. Level p_lower (%) p_upper (%) disease_by_infectious_agent p-value.1 ... Sig. Level.135 p_lower (%).135 p_upper (%).135 muscular_dystrophy p-value.136 {p-value}.136 Sig. Level.136 p_lower (%).136 p_upper (%).136 Unnamed: 824
0 1 1 NaN 1.000000 0.880437 0.000000 100.0 85.9 NaN 1.000000 ... 0.000000 100.0 98.9 NaN 1.000000 0.983303 0.000000 100 98.3 NaN
1 2 99 NaN 0.765480 0.529656 0.013399 76.6 32.3 NaN 0.970181 ... 0.005178 2.3 99.6 NaN 0.535656 0.606840 0.015771 55 74.1 NaN
2 3 37 NaN 0.684769 0.560950 0.014692 69.7 48.4 NaN 0.009547 ... 0.015230 37.4 93.7 NaN 1.000000 0.686668 0.000000 100 49.4 NaN
3 4 17 NaN 0.490462 0.594665 0.015808 48.8 73.4 NaN 0.783821 ... 0.012346 19.0 98.3 NaN 1.000000 0.784876 0.000000 100 71.6 NaN
4 5 17 NaN 0.940347 0.603658 0.007490 94.1 24.9 NaN 0.949553 ... 0.000000 100.0 79.9 NaN 1.000000 0.797138 0.000000 100 73.2 NaN

5 rows × 825 columns

How to interpret these results? I suppose we would like to split it by the different diseases and then detect those with low p-values, probably by sorting them. All of the disease columns are regularly spaced, so should be easy to split into different dataframes:


In [27]:
def splitbydisease(diseasedataframe):
    diseasedict = {}
    clabels = diseasedataframe.iloc[:,[0,1,3,4,5,6,7]].columns
    for offset in range(2,diseasedataframe.columns.shape[0]-1,6):
        diseaseindexes = [0,1] + range(offset+1,offset+6)
        diseasedict[diseasedataframe.columns[offset]]=diseasedataframe.iloc[:,diseaseindexes]
        #fix the column names
        diseasedict[diseasedataframe.columns[offset]].columns = clabels
    return diseasedict

In [28]:
deuwdict = splitbydisease(deuw)

In [29]:
dewdict = splitbydisease(dew)

Now we can search for diseases in each of these communities with p-values less than 5%.


In [30]:
def significantdiseases(diseasedict):
    sigdiseasedict = {}
    for k in diseasedict:
        pvalues = diseasedict[k].iloc[:,2]
        pvalues = pvalues[pvalues<0.05]
        sigdiseasedict[k] = diseasedict[k].iloc[pvalues.index]
    return sigdiseasedict

In [31]:
sigdeuwdict = significantdiseases(deuwdict)

In [32]:
sigdewdict = significantdiseases(dewdict)

Now we want to iterate through the diseases and sort the lowest p-values, while annotating which disease and community is involved. Should probably build a new dataframe and put this information in it.

To keep the analysis simple, we only want to consider schizophrenia and alzheimers.


In [33]:
interestdiseases = ["Alzheimer's_disease",'schizophrenia']

In [34]:
for k in sigdeuwdict:
    sigdeuwdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigdeuwdict[k],interestdiseases)

In [35]:
sigunweighted = pd.concat(pieces)

In [36]:
for k in sigdewdict:
    sigdewdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigdewdict[k],interestdiseases)

In [37]:
sigweighted = pd.concat(pieces)

In [38]:
sigunweighted.sort(columns='p-value')


Out[38]:
Community Size disease p-value {p-value} Sig. Level p_lower (%) p_upper (%)
28 29 88 schizophrenia 0.001691 0.549829 0.001299 0.2 99.9
1 2 99 schizophrenia 0.004321 0.547038 0.002074 0.5 99.9
29 30 26 Alzheimer's_disease 0.005374 0.597882 0.002312 0.7 99.8
60 61 24 schizophrenia 0.019591 0.562575 0.004383 1.5 99.8
62 63 23 Alzheimer's_disease 0.035030 0.573961 0.005814 4.3 98.2
60 61 24 Alzheimer's_disease 0.042531 0.593381 0.006381 4.5 98.2
62 63 23 schizophrenia 0.046159 0.589334 0.006635 4.3 98.5

In [39]:
sigweighted.sort(columns='p-value')


Out[39]:
Community Size disease p-value {p-value} Sig. Level p_lower (%) p_upper (%)
43 44 35 schizophrenia 0.003446 0.564032 0.001853 0.6 99.7
44 45 73 schizophrenia 0.003900 0.528577 0.001971 0.4 99.9
47 48 17 Alzheimer's_disease 0.007570 0.619867 0.002741 0.2 100.0
32 33 64 Alzheimer's_disease 0.008390 0.563842 0.002884 0.7 99.5
32 33 64 schizophrenia 0.023347 0.541118 0.004775 2.2 98.8
13 14 18 schizophrenia 0.041873 0.590369 0.006334 3.2 99.5
1 2 48 schizophrenia 0.046719 0.563059 0.006674 4.9 98.0
14 15 8 Alzheimer's_disease 0.049793 0.666429 0.006879 5.2 99.2

Now all that's required is to retreive the IDs of the proteins in each of these communities to compare the different disease associated communities. Wait, Cytoscape already does that for me. I can just use that.

Writing these results to a file so I can discuss them later:


In [40]:
!git annex unlock unweighted_significant_disease_communities.tsv


unlock unweighted_significant_disease_communities.tsv (copying...) ok

In [41]:
sigunweighted.sort(columns='p-value').to_csv("unweighted_significant_disease_communities.tsv",
                                             sep="\t",index=False)

In [42]:
!git annex unlock weighted_significant_disease_communities.tsv


unlock weighted_significant_disease_communities.tsv (copying...) ok

In [43]:
sigweighted.sort(columns='p-value').to_csv("weighted_significant_disease_communities.tsv",
                                           sep="\t",index=False)

In [44]:
!head unweighted_significant_disease_communities.tsv


Community	Size	disease	p-value	{p-value}	Sig. Level	p_lower (%)	p_upper (%)
29	88	schizophrenia	0.00169105	0.549829	0.0012993	0.2	99.9
2	99	schizophrenia	0.00432129	0.547038	0.00207427	0.5	99.9
30	26	Alzheimer's_disease	0.00537398	0.597882	0.00231195	0.7	99.8
61	24	schizophrenia	0.019590700000000003	0.562575	0.00438256	1.5	99.8
63	23	Alzheimer's_disease	0.0350298	0.5739609999999999	0.00581401	4.3	98.2
61	24	Alzheimer's_disease	0.0425309	0.593381	0.00638138	4.5	98.2
63	23	schizophrenia	0.046159399999999996	0.589334	0.00663542	4.3	98.5

Crossover between weighted and unweighted

The obvious question to ask is what is the crossover between the high confidence disease communities in each of these groupings. We can find this by comparing the members of each of these communities. While doing the NMI test we loaded in these communities, so we can access them and compare them easily.


In [45]:
weightedcommunities = {}
for l in zip(*parsecommunityout("OUT/communityout.txt")):
    try:
        weightedcommunities[l[1]] += [l[0]]
    except KeyError:
        weightedcommunities[l[1]] = [l[0]]

In [46]:
unweightedcommunities = {}
for l in zip(*parsecommunityout("unweightedOUT/communityout.txt")):
    try:
        unweightedcommunities[l[1]] += [l[0]]
    except KeyError:
        unweightedcommunities[l[1]] = [l[0]]

Unweighted community 29

The most likely disease association in the above unweighted table is community 29 with a size of 88. Comparing this to find the closest match in the weighted case:


In [47]:
def findsimilarcom(community,communitydict):
    overlapdict = {}
    for k in communitydict:
        #make two lists of inclusion/exclusion for all shared ids
        uwids = community
        wids = communitydict[k]
        tids = set(uwids+wids)
        winc,uwinc = [],[]
        for i in tids:
            winc.append(int(i in wids))
            uwinc.append(int(i in uwids))
        overlap = sum(map(lambda a,b: int(a==b), uwinc,winc))/len(tids)
        if overlap > 0.0:
            overlapdict[k] = overlap
    return overlapdict

In [48]:
findsimilarcom(unweightedcommunities[29],weightedcommunities)


Out[48]:
{2: 0.014925373134328358,
 7: 0.0086956521739130436,
 13: 0.0058479532163742687,
 24: 0.031914893617021274,
 26: 0.0059880239520958087,
 30: 0.011363636363636364,
 31: 0.011363636363636364,
 32: 0.019047619047619049,
 33: 0.44761904761904764,
 40: 0.052631578947368418,
 41: 0.008771929824561403,
 43: 0.011363636363636364,
 44: 0.0081967213114754103,
 45: 0.080536912751677847,
 46: 0.030534351145038167,
 48: 0.0096153846153846159,
 54: 0.02,
 83: 0.008771929824561403}

So many of these proteins are spread around smaller other communities in the weighted case, but most are found in another community which overlaps 31%.


In [49]:
len(weightedcommunities[33]),len(unweightedcommunities[29])


Out[49]:
(64, 88)

Investigating the effect of the weights. Would like to know which weighting might have caused the community to be split like this. We can inspect the graphs more easily by plotting them:


In [50]:
#load the edges of the graph
interactions = loadtxt("testdata/edgelist_update_weighted.txt",dtype=str)
interactions = interactions[1:]

In [51]:
import networkx as nx

In [52]:
def plotcommunities(com1,com2):
    G = nx.Graph()
    for l in interactions:
        if l[0] in set(com1+com2) and l[1] in set(com1+com2):
            G.add_edge(l[0],l[1],weight=float(l[2]))
    edict = {}
    lim = min([d['weight'] for (u,v,d) in G.edges(data=True)])
    diff = linspace(lim,1.0,10)[1]- linspace(lim,1.0,10)[0]
    for x in linspace(lim,1.0,10):
        edict[x] = [(u,v) for (u,v,d) in G.edges(data=True) if d['weight'] > x and d['weight'] < x+diff]
    pos = nx.circular_layout(com1)
    pos2 = nx.circular_layout(set(com2)-set(com1))
    for k in pos2:
        pos2[k] = array([pos2[k][0]+1.5,pos2[k][1]])
    pos = dict(pos.items()+pos2.items())
    nx.draw_networkx_nodes(G,pos,node_size=20,alpha=0.5)
    for k in edict:
        nx.draw_networkx_edges(G,pos,edgelist=edict[k],alpha=(k-lim)*(1/(1-lim)),edge_color='r')
        #nx.draw_networkx_edges(G,pos,edgelist=edict[k],edge_color='r')
    l=nx.draw_networkx_labels(G,pos,font_size=10,font_family='sans-serif')
    return None

In [53]:
plotcommunities(weightedcommunities[33],unweightedcommunities[29])



In [80]:
import os

In [81]:
plotpath = os.path.abspath("../plots/bayes/")

In [82]:
savez(os.path.join(plotpath,"nx2933.npz"),unweightedcommunities[29],weightedcommunities[33])

In [54]:
plotcommunities(unweightedcommunities[29],weightedcommunities[33])


Average weighting of the edges within each community:


In [56]:
def averageweight(community):
    weights = []
    for l in interactions:
        if l[0] in community and l[1] in community:
            weights.append(float(l[2]))
    return average(weights)

In [57]:
averageweight(weightedcommunities[33])


Out[57]:
0.89165857738239485

In [58]:
averageweight(unweightedcommunities[29])


Out[58]:
0.89599059455826779

Weighted community 44

Now investigating a high-likelihood community from the weighted case:


In [59]:
s=findsimilarcom(weightedcommunities[44],unweightedcommunities)
s


Out[59]:
{2: 0.015151515151515152,
 29: 0.0081967213114754103,
 30: 0.016666666666666666,
 63: 0.11538461538461539,
 64: 0.48717948717948717,
 69: 0.029999999999999999,
 77: 0.011627906976744186,
 84: 0.019230769230769232}

So the proteins involved are split amount some larger communities in the unweighted case, mostly in community 6. We can plot these:


In [60]:
plotcommunities(weightedcommunities[44],unweightedcommunities[64])



In [83]:
savez(os.path.join(plotpath,"nx6444.npz"),unweightedcommunities[64],weightedcommunities[44])

In [61]:
plotcommunities(unweightedcommunities[64],weightedcommunities[44])


Schizophrenia functional annotations

The Disease Enrichment test also produces a similar table annotating functions that are associated with Schizophrenia. We can repeat the above parsing to find the likely ones of this:


In [62]:
schw = pd.read_csv("OUT/permute_p_values_SCH_enriched.csv", delimiter="\t")
schuw = pd.read_csv("unweightedOUT/permute_p_values_SCH_enriched.csv", delimiter="\t")

In [63]:
sigschwdict = significantdiseases(splitbydisease(schw))
sigschuwdict = significantdiseases(splitbydisease(schuw))

In [64]:
for k in sigschwdict:
    sigschwdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigschwdict[k],sigschwdict)

In [65]:
sigschweighted = pd.concat(pieces)

In [66]:
for k in sigschuwdict:
    sigschuwdict[k].insert(2,'disease',k)
pieces = map(lambda k: sigschuwdict[k],sigschuwdict)

In [67]:
sigschunweighted = pd.concat(pieces)

In [68]:
sigschweighted.sort(columns='p-value')


Out[68]:
Community Size disease p-value {p-value} Sig. Level p_lower (%) p_upper (%)
25 26 80 Endocytosis -6.661340e-16 0.638668 NaN 0.0 100.0
44 45 73 Exocytosis 2.220450e-16 0.607270 4.712160e-10 0.0 100.0
39 40 32 CAT_signaling 6.416100e-10 0.672824 8.010050e-07 0.0 100.0
45 46 47 Exocytosis 6.159730e-09 0.632842 2.481880e-06 0.0 100.0
45 46 47 Intracellular_trafficking 1.953090e-07 0.663362 1.397530e-05 0.0 100.0
23 24 9 CAT_signaling 6.450880e-07 0.837461 2.539860e-05 0.0 100.0
32 33 64 Excitability 1.881530e-06 0.694875 4.337650e-05 0.0 100.0
53 54 14 Intracellular_trafficking 5.222120e-06 0.803925 7.226400e-05 0.0 100.0
44 45 73 Neurotransmitter_metabolism 9.475210e-06 0.763295 9.734020e-05 0.0 100.0
7 8 53 Cell_metabolism 1.138280e-05 0.659110 1.066900e-04 0.0 100.0
3 4 48 Structural_plasticity 1.576900e-05 0.603017 1.255740e-04 0.0 100.0
39 40 32 LGIC_signaling 1.663450e-05 0.856139 1.289740e-04 0.0 100.0
1 2 48 Protein_cluster 4.553190e-05 0.654236 2.133770e-04 0.0 100.0
32 33 64 Unknown 4.553640e-05 0.686648 2.133880e-04 0.0 100.0
77 78 15 G-protein_relay 8.115600e-05 0.900729 2.848670e-04 0.0 100.0
45 46 47 Ion_balance/transport 1.457220e-04 0.671435 3.817070e-04 0.0 100.0
46 47 5 G-protein_relay 3.607280e-04 0.970943 6.004980e-04 0.0 100.0
49 50 19 Protein_cluster 5.627940e-04 0.762330 7.499850e-04 0.0 100.0
12 13 84 RPSFB 6.244790e-04 0.619736 7.899930e-04 0.0 100.0
44 45 73 Protein_cluster 6.846600e-04 0.601769 8.271590e-04 0.2 100.0
44 45 73 CAT_signaling 1.170980e-03 0.624930 1.081490e-03 0.1 100.0
43 44 35 GPCR_signaling 1.425310e-03 0.941988 1.193010e-03 0.1 100.0
36 37 2 Intracellular_Signal_Transduction 2.369600e-03 0.901327 1.537530e-03 0.1 100.0
5 6 45 Cell_metabolism 2.620530e-03 0.644941 1.616680e-03 0.2 100.0
1 2 48 G-protein_relay 2.759550e-03 0.794211 1.658900e-03 0.2 100.0
17 18 120 Structural_plasticity 3.187110e-03 0.565784 1.782400e-03 0.3 100.0
23 24 9 Excitability 3.791720e-03 0.920861 1.943540e-03 0.3 100.0
32 33 64 Protein_cluster 1.065260e-02 0.631767 3.246400e-03 1.3 99.7
29 30 1 Excitability 1.081420e-02 0.985162 3.270670e-03 1.5 100.0
42 43 1 Excitability 1.081420e-02 0.988130 3.270670e-03 1.2 100.0
17 18 120 Intracellular_Signal_Transduction 1.149320e-02 0.550901 3.370620e-03 1.4 99.6
14 15 8 Cell_metabolism 1.240010e-02 0.865700 3.499480e-03 1.0 99.7
52 53 31 Intracellular_Signal_Transduction 1.544150e-02 0.630677 3.899110e-03 1.7 99.8
43 44 35 LGIC_signaling 1.572140e-02 0.842228 3.933730e-03 1.6 99.9
34 35 3 LGIC_signaling 1.708820e-02 0.985256 4.098320e-03 1.5 100.0
52 53 31 Ion_balance/transport 1.793900e-02 0.705077 4.197290e-03 2.0 99.9
18 19 12 Ion_balance/transport 1.934710e-02 0.833093 4.355780e-03 1.7 100.0
73 74 1 Cell_metabolism 2.226460e-02 0.978490 4.665720e-03 2.2 100.0
38 39 2 Unknown 2.277690e-02 0.977524 4.717850e-03 2.3 100.0
4 5 26 RPSFB 2.309210e-02 0.715671 4.749620e-03 1.8 99.9
1 2 48 CAT_signaling 2.411220e-02 0.637163 4.850860e-03 1.7 99.9
43 44 35 Ion_balance/transport 2.486020e-02 0.711774 4.923630e-03 1.7 99.8
43 44 35 Intracellular_Signal_Transduction 2.532140e-02 0.610452 4.967920e-03 2.1 99.5
15 16 17 Structural_plasticity 3.152940e-02 0.672475 5.525880e-03 4.3 99.8
32 33 64 Intracellular_Signal_Transduction 3.348920e-02 0.575023 5.689260e-03 3.8 98.7
12 13 84 Cell_metabolism 3.535680e-02 0.614940 5.840100e-03 3.4 99.3
49 50 19 Tyrosine_kinase_signaling 3.584560e-02 0.960434 5.878830e-03 4.1 99.9
5 6 45 Structural_plasticity 3.724010e-02 0.592989 5.987760e-03 3.8 99.4
47 48 17 Ion_balance/transport 3.765970e-02 0.804750 6.020090e-03 3.5 99.8
21 22 1 Structural_plasticity 4.198470e-02 0.966469 6.342080e-03 3.5 100.0
81 82 17 Intracellular_Signal_Transduction 4.681000e-02 0.671723 6.679730e-03 5.7 99.1
61 62 1 Intracellular_Signal_Transduction 4.898220e-02 0.950547 6.825170e-03 5.2 100.0
72 73 1 Intracellular_Signal_Transduction 4.898220e-02 0.956253 6.825170e-03 4.6 100.0

In [69]:
sigschunweighted.sort(columns='p-value')


Out[69]:
Community Size disease p-value {p-value} Sig. Level p_lower (%) p_upper (%)
1 2 99 Exocytosis -6.661340e-16 0.596752 NaN 0.0 100.0
68 69 68 Endocytosis -4.440890e-16 0.691935 NaN 0.0 100.0
60 61 24 G-protein_relay 8.690600e-12 0.873672 9.322340e-08 0.0 100.0
28 29 88 Excitability 8.561880e-11 0.662133 2.926070e-07 0.0 100.0
58 59 31 CAT_signaling 4.482160e-10 0.689979 6.694890e-07 0.0 100.0
72 73 13 Protein_cluster 1.814150e-09 0.816884 1.346900e-06 0.0 100.0
1 2 99 Intracellular_trafficking 2.364650e-07 0.597795 1.537740e-05 0.0 100.0
28 29 88 Protein_cluster 8.989080e-07 0.615212 2.998180e-05 0.0 100.0
16 17 5 Structural_plasticity 1.376690e-05 0.845397 1.173320e-04 0.0 100.0
3 4 17 Cell_metabolism 2.083410e-05 0.764369 1.443390e-04 0.0 100.0
6 7 92 RPSFB 3.503870e-05 0.599114 1.871830e-04 0.0 100.0
1 2 99 Neurotransmitter_metabolism 4.320860e-05 0.722620 2.078620e-04 0.0 100.0
69 70 3 Structural_plasticity 7.081230e-05 0.904585 2.660960e-04 0.0 100.0
61 62 24 Unknown 1.111100e-04 0.821977 3.333130e-04 0.0 100.0
71 72 44 CAT_signaling 4.116570e-04 0.656179 6.414730e-04 0.0 100.0
15 16 49 Structural_plasticity 7.192340e-04 0.580705 8.477720e-04 0.0 100.0
28 29 88 LGIC_signaling 9.317240e-04 0.737109 9.648090e-04 0.0 100.0
27 28 19 Unknown 1.090500e-03 0.831836 1.043700e-03 0.0 100.0
54 55 2 Intracellular_Signal_Transduction 2.369600e-03 0.914893 1.537530e-03 0.1 100.0
58 59 31 Intracellular_Signal_Transduction 3.135030e-03 0.626750 1.767820e-03 0.1 99.9
22 23 71 Cell_metabolism 3.932760e-03 0.616346 1.979220e-03 0.0 100.0
62 63 23 Intracellular_Signal_Transduction 4.158490e-03 0.645144 2.034990e-03 0.2 100.0
30 31 3 Tyrosine_kinase_signaling 5.717900e-03 0.996023 2.384370e-03 0.4 100.0
26 27 33 CAT_signaling 6.526000e-03 0.688906 2.546260e-03 0.4 100.0
62 63 23 LGIC_signaling 6.928630e-03 0.884490 2.623100e-03 0.7 100.0
4 5 17 RPSFB 7.016560e-03 0.756378 2.639570e-03 0.5 100.0
62 63 23 Ion_balance/transport 7.804610e-03 0.726029 2.782750e-03 0.5 100.0
61 62 24 Ion_balance/transport 8.809520e-03 0.747960 2.954980e-03 1.4 100.0
58 59 31 LGIC_signaling 1.243410e-02 0.888333 3.504200e-03 0.9 99.9
67 68 1 Endocytosis 1.335880e-02 0.987174 3.630470e-03 1.3 100.0
28 29 88 CAT_signaling 1.478950e-02 0.600355 3.817170e-03 1.0 99.8
32 33 3 LGIC_signaling 1.708820e-02 0.981325 4.098320e-03 1.9 100.0
60 61 24 CAT_signaling 1.728890e-02 0.720771 4.121890e-03 1.5 100.0
77 78 11 Intracellular_trafficking 2.088610e-02 0.834183 4.522150e-03 1.5 100.0
60 61 24 Exocytosis 2.133330e-02 0.709442 4.569270e-03 1.8 99.9
47 48 1 Cell_metabolism 2.226460e-02 0.981423 4.665720e-03 1.9 100.0
82 83 1 Cell_metabolism 2.226460e-02 0.976534 4.665720e-03 2.4 100.0
34 35 2 Unknown 2.277690e-02 0.976547 4.717850e-03 2.4 100.0
13 14 95 RPSFB 2.376770e-02 0.608652 4.816930e-03 2.2 99.6
66 67 1 Exocytosis 2.544530e-02 0.972712 4.979740e-03 2.8 100.0
19 20 87 Structural_plasticity 2.600320e-02 0.567603 5.032600e-03 2.6 98.7
22 23 71 Structural_plasticity 2.619760e-02 0.574707 5.050870e-03 2.1 99.3
13 14 95 Intracellular_Signal_Transduction 3.836680e-02 0.568902 6.074110e-03 4.3 98.0
14 15 16 Intracellular_Signal_Transduction 3.992310e-02 0.693810 6.191060e-03 3.2 99.8
62 63 23 GPCR_signaling 4.328120e-02 0.964559 6.434900e-03 3.7 99.9
63 64 23 GPCR_signaling 4.328120e-02 0.963602 6.434900e-03 3.8 99.9
68 69 68 Neurotransmitter_metabolism 4.361520e-02 0.783897 6.458550e-03 4.1 99.7
60 61 24 GPCR_signaling 4.513410e-02 0.959851 6.564830e-03 4.2 99.9
14 15 16 Cell_metabolism 4.753880e-02 0.776354 6.728960e-03 3.3 99.9
43 44 1 Intracellular_Signal_Transduction 4.898220e-02 0.942939 6.825170e-03 6.0 100.0
40 41 1 Intracellular_Signal_Transduction 4.898220e-02 0.943890 6.825170e-03 5.9 100.0

In [72]:
!git annex unlock unweighted_significant_SCH_communities.tsv


unlock unweighted_significant_SCH_communities.tsv (copying...) ok

In [73]:
!git annex unlock weighted_significant_SCH_communities.tsv


unlock weighted_significant_SCH_communities.tsv (copying...) ok

In [74]:
#save these tables
sigschunweighted.sort(columns='p-value').to_csv("unweighted_significant_SCH_communities.tsv",
                                             sep="\t",index=False)
sigschweighted.sort(columns='p-value').to_csv("weighted_significant_SCH_communities.tsv",
                                             sep="\t",index=False)

Function-Schizophrenia crossover

This file summarises the crossover between the functions and the disease to summarise which functions are most likely to be involved with the disease.


In [75]:
overlap = pd.read_csv("unweightedOUT/overlap_dis_SCHenriched.csv", sep="\t",header=None,names=range(139))

In [76]:
overlap = overlap.fillna('')

We are only interested in the overlap with schizophrenia, so we want to cut out just that column.


In [77]:
overlap = overlap.iloc[:,[0,1,108]]

In [78]:
with open("unweightedOUT/overlap_dis_SCHenriched.csv") as f:
    c = csv.reader(f,delimiter="\t")
    overlap = {}
    for l in c:
        try:
            if l[0] == 'community':
                currentcommunity = l[1]
            if len(l) > 108:
                if l[108] == '1':
                    try:
                        overlap[currentcommunity] += [l[0]]
                    except KeyError:
                        overlap[currentcommunity] = [l[0]]
        except:
            pass

Ok, so we now have access to this data.